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ABSTRACT 

Babesia spp. are tick-borne, intraerythrocytic 
hemoparasites that use antigenic variation to resist 
host immunity, through sequential modification of 
the parasite-derived variant erythrocyte surface anti- 
gen (VESA) expressed on the infected red blood cell 
surface. We identified the genomic processes driving 
antigenic diversity in genes encoding VESA (ves1) 
through comparative analysis within and between 
three Babesia species, (B. bigemina, B. divergens 
and B. bovis). Ves1 structure diverges rapidly after 
speciation, notably through the evolution of short- 
ened forms (ves2) from 5 ends of canonical ves1 
genes. Phylogenetic analyses show that ves1 genes 
are transposed between loci routinely, whereas ves2 



genes are not. Similarly, analysis of sequence mo- 
saicism shows that recombination drives variation in 
ves1 sequences, but less so for ves2, indicating the 
adoption of different mechanisms for variation of the 
two families. Proteomic analysis of the B. bigemina 
PR isolate shows that two dominant VESA1 proteins 
are expressed in the population, whereas numerous 
VESA2 proteins are co-expressed, consistent with 
differential transcriptional regulation of each family. 
Hence, VESA2 proteins are abundant and previously 
unrecognized elements of Babesia biology, with evo- 
lutionary dynamics consistently different to those of 
VESA1, suggesting that their functions are distinct. 
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INTRODUCTION 

Antigenic variation in pathogens is an adaptation to pro- 
tective host immunity. It describes the serial replacement 
of variant antigens situated on the pathogen surface dur- 
ing an infection, causing the negation of immune responses 
mounted to the preceding antigen. Ultimately, the host can- 
not gain effective immunity and so suffers persistent infec- 
tion and recurrent disease. Antigenic variation is a tangible 
expression of the coevolutionary arms race between hosts 
and pathogens and has evolved on several occasions in di- 
verse lineages (1). This includes the etiological agents of 
prominent diseases of humans and animals, such as malaria 
(2), trypanosomiasis (3,4), influenza (5), bacterial meningi- 
tis (6) and anaplasmosis (7). Since antigenic variation is fun- 
damental to circumventing the immune response, it plays a 
central role in pathogenesis and virulence and is one of the 
main obstacles to developing effective vaccines (8-10). 

This study investigates antigenic variation in Babesia 
spp., apicomplexan hemoparasites that cause a tick-borne 
disease in animals and occasionally in people (11). Babesio- 
sis has a global distribution, and while found predomi- 
nantly in tropical and sub-tropical regions, it frequently 
reaches far into temperate regions (12). Bovine babesiosis 
has a significant negative impact on livestock productivity 
in both developed and developing countries. While most in- 
fections are asymptomatic, mild infections can cause fever 
and diarrhoea while severe infections, of the type caused by 
Babesia bovis in cattle, lead to acute haemolytic anaemia 
and may trigger a cerebral disease that is likely inflamma- 
tory in origin and almost invariably terminal. Here we ex- 
amine two additional species that cause babesiosis in live- 
stock, Babesia bigemina (13) and Babesia divergens (14). 
These three species are members of the Babesia sensu stricto 
clade and, of the three, B. bigemina and B. divergens are most 
closely related (15). Babesiosis in humans is uncommon; 
however, B. divergens infections are increasingly reported 
in immuno -compromized or splenectomized patients, often 
causing death (16). Babesia microti is another human para- 
site but will not be considered here since, despite its generic 
name and the presence of three ves-like genes in its genome 
(17), it is relatively distant from Babesia sensu stricto in phy- 
logenetic terms and offers limited value as an out-group in 
this study (15). 

Babesia spp. infections begin with invasion of host 
red blood cells and continue through cycles of intra- 
erythrocytic re-invasion. Secreted parasite proteins are 
present on the surface of infected red blood cells (IRBC). 
Variation in these epitopes first indicated that serological 
responses to B. bovis infections are isolate-specific and do 
not provide lasting immunity (18). Antigenic variation was 
directly inferred from successive waves of gain and loss of 
immuno-reactivity to the major expressed IRBC surface 
antigen, and changes in antigen size, over the course of a 
clonal infection (19). The proteins responsible for these re- 
sponses in B. bovis are the heterodimeric variant erythrocyte 
surface antigen (VESA), encoded by the multi-copy vesla 
and vesl ft gene families in B. bovis (20,21). vesl genes are ex- 
pressed from specific sites known as loci of active transcrip- 
tion (LAT) that typically contain a vesla gene and a veslfi 
gene in divergent orientation (22). vesl gene expression is 



regulated by bidirectional promoters within the LAT inter- 
genic region (23) and involves localized chromatin remod- 
eling (24). This precise regulatory mechanism is thought 
to ensure monoallelic expression, in which only the 'active' 
vesl gene within the LAT is expressed while all other loci 
are silenced; and recent analysis of vesl complementary de- 
oxyribonucleic acid (cDNA) in B. bovis C9.1 supported this 
view (25). 

The vesl gene repertoire of B. bovis has been established 
from its complete genome sequence (26). This family con- 
tains 72 vesla, 43 vesl ft and four atypical genes that do 
not conform to either structure; 66 vesl (55%) are arranged 
in divergent orientation, i.e. potential LATs. These loci are 
distributed throughout the chromosomes, and overall, 75% 
of vesl genes are found in the interstitial chromosome re- 
gions, rather than showing any positional bias towards sub- 
telomeres. LATs also tend to contain a third locus, the small 
open reading frame (smORF) genes (26), whose function is 
unknown. Vesla and veslfi genes have different gene archi- 
tectures; canonical vesla genes consist of three exons and 
two short introns; vesl ft genes vary widely in intron num- 
ber and length, with up to 11 introns (20,22,26). Due to 
the relative simplicity of the vesla gene structure and lack 
of structural variation between paralogs relative to vesl ft, 
it was suggested that vesla may have originated through 
the retro transcription of an incompletely spliced vesl/3 gene 
transcript (22). 

The ability of pathogens to employ antigenic variation in 
the long-term depends on the generation of novel variant 
antigens through evolutionary change. Genomic research in 
two models of antigenic variation, var genes in the malaria 
parasite Plasmodium falciparum (27-29) and vsg genes in 
the African trypanosome Trypanosoma brucei (30,31), have 
begun to explain how variant antigens change over evolu- 
tionary time, and how their dynamics differ from 'ordinary' 
genes (32). By taking a similar approach, we can examine 
how antigenic diversity is generated in Babesia. The evo- 
lution of variant antigen genes is rapid; gene duplication 
is more frequent and sequence divergence faster relative to 
the genomic background (27,33). Rapid change is due in 
part to frequent gene conversion, which is a strategy inde- 
pendently evolved by many pathogens to promote antigenic 
diversity (34). Phylogenetic comparisons of both var and 
vsg show that rapid change results in species-specific gene 
repertoires and a lack of orthology [i.e. genes from differ- 
ent species form mutually exclusive clades (31,35)], as well 
as structural differentiation of the repertoire for life stage or 
disease stage-specific expression (36,37), and the derivation 
of novel, invariant genes that appear to acquire new func- 
tions (30,31). 

Our understanding of vesl gene diversity and antigenic 
variation is largely based on the analysis of B. bovis. In 
this study, our aim is to characterize the variant antigen 
repertoires in multiple Babesia species using comparative 
genomics and, to identify how variant antigens originated, 
how repertoires in other Babesia spp. vary from the model 
based on B. bovis, and to examine the evolutionary pro- 
cesses that generate antigenic diversity on a population 
scale. We have sequenced the genomes of several strains of 
B. bigemina and B. divergens. Our results show that, while 
vesl genes are widespread, primary structure and genomic 
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repertoire varies substantially, such that features in B. bovis 
like heterodimerism and organization into LATs, may not 
occur in other species. This study reveals a history of con- 
stant innovation in Babesia genomes with respect to genes 
implicated in antigenic variation, and, in so doing, we have 
discovered that novel ves-like gene families, thought to be 
secreted, have been independently derived from canonical 
vesl genes in different species. The expression profiles and 
evolutionary dynamics of these novel ves-like genes differ 
from canonical vesl genes, suggesting that they are exposed 
to fundamentally different functional constraints. 

MATERIALS AND METHODS 

Parasite isolation 

The B. bovis C9.1 clonal line is a derivative of the Mex- 
ico isolate. Its derivation from the M07 clonal line has 
been described before (19). The B. bigemina BOND clone 
C7-1 was derived from an Australian isolate in 1991 by 
the Queensland Department of Agriculture, Fisheries and 
Forestry. It was cloned by limiting dilution in in vitro cul- 
ture (38) and then stored under liquid nitrogen until 2004 
when it was removed and inoculated into a splenectomized 
Bos taurus calf. Parasites were first detected five days af- 
ter inoculation of the calf and 2 1 of peripheral blood was 
collected after a further two days. Parasites were purified 
from erythrocytes and white blood cells using a saponin ly- 
sis method (39) and then DNA was extracted using phenol- 
chloroform. The B. bigemina JG29 and Puerto Rico (PR) 
isolates were kind gifts from T.F. McElwain and were main- 
tained in in vitro culture for isolation of nucleic acids and 
proteins. The B. bigemina BbiS3P strain was isolated from 
an 1 8 months old Braford heifer, chronically infected, in the 
Salta province, Argentina, in March 2005. The isolation was 
achieved in vitro, by cultivation of heifer's erythrocytes pre- 
viously washed with buffer VYM. The B. diver gens 1802A 
strain was isolated in May 1988 from a cow in the Le Cher 
department of France with acute babesiosis. The B. diver- 
gens strain Rouenl987 (40) was a kind gift of H. Vial, Mont- 
pellier. 

Cell culture 

The B. bovis C9.1 clonal line, and B. bigemina JG29 and 
PR isolate parasites were cultivated in vitro under mi- 
croaerophilous stationary phase conditions as described 
(41), with slight modifications (42). Closed herd bovine 
blood donors maintained at the University of Florida were 
used in all work with these strains. All animal procedures 
were approved by the University of Florida Institutional 
Animal Care and Use Committee. The B. bigemina BbiS3P 
strain was also cultivated in vitro and stored frozen in liq- 
uid nitrogen. A culture expansion to achieve >50% of para- 
sitized erythrocytes was accomplished to purify merozoites 
for gDNA extraction. The B. divergens 1802A strain was ob- 
tained by infecting a single gerbil with the cow blood. In 
vitro culture was initiated from this isolate. After initiation, 
we cloned the culture by the limiting dilution technique to 
avoid producing DNA from a mixed parasite population. 
One clone was selected and expanded in vitro immediately. 
The B. divergens Rouen 1987 was cultivated in human red 



blood cells, blood group A+ in 10% human serum of blood 
group A+. Cultivation was in a 5% O2, 5% CO2 atmosphere. 

Sample preparation 

The B. bovis C9.1 and B. bigemina JG29 and PR parasites 
were grown in erythrocytes depleted of contaminating DNA 
and bovine white blood cells by use of Whatman CF-1 1 cel- 
lulose (43). B. bigemina PR-infected erythrocytes were en- 
riched to near homogeneity using Percoll gradients (44). 
Genomic DNA was isolated as described (45) with spool- 
ing at the final step. For protein analysis, cells were washed 
two times with phosphate-buffered saline, the supernatants 
were removed and the cells were flash-frozen on dry ice. B. 
bigemina BbiS3P free merozoites were isolated from stroma 
by using Percoll gradients. Merozoite parasites were pel- 
leted and pellets were lysed at 58°C for 1 h in lysis buffer 
(0.05 M Tris-HCl pH 8.0, 0.1 M EDTA, 0.1 M NaCl, 2% 
SDS) with 160 |xg of proteinase K (Invitrogen®). gDNA 
was extracted with 1 vol of phenol/chloroform/isoamyl al- 
cohol (Invitrogen®), precipitated with ice-cold isopropyl al- 
cohol and washed once with 75% ice-cold ethanol. Pellets 
were suspended in 50 |xl distilled water and were kept at 
— 20° C until use. Genomic DNA was prepared for B. diver- 
gens 1802A and Rouen 1987 from cell lysates of monoclonal 
cultures using commercial kits (Promega and Qiagen, re- 
spectively). 

Genome sequencing and annotation 

The B. bigemina BOND genome was sequenced using capil- 
lary technology to 8X coverage and assembled with Phrap. 
The genome was manually improved in Gap4. Sequenc- 
ing errors were corrected using a polymerase chain reaction 
(PCR) free Illumina library and the iCORN algorithm (46). 
The B. divergens 1802A strain was sequenced using the 454 
Titanium platform combining reads from a genomic frag- 
ment paired-end library with 3 and 8 kb mate-pair libraries 
and assembled using Newbler v2.6. Sequencing errors were 
corrected using additional SOLiD v3 reads. Protein-coding 
genome annotation for B. bigemina BOND and B. divergens 
1802A were generated by taking a consensus of several ab 
initio gene prediction programs. Annotation from B. bovis 
(26) was used to train Augustus (47), glimmerHMM (48) 
and SNAP (49). Each of these trained gene finders were 
then run on the target genomes and this annotation was 
combined in each case using Jigsaw (50). The annotation 
was improved manually. Three further B. bigemina genomes 
(PR, BbiS3P and JG29), the B. bovis C9.1 genome and the 
B. divergens Rouen 1987 genome were sequenced using the 
Illumina GA platform. The libraries were prepared with the 
PCR-free protocol and sequenced to 76 read pairs with an 
insert size of around 350 bp. Illumina reads were assembled 
with velvet (51). The genomes were improved using the Post 
Assembly Genome Improvement Toolkit (PAGIT) pipeline 
(52), applying the contig ordering against the BOND refer- 
ence using the program ABACUS (53) and the gap-filling 
step (54). For the B. bigemina genome a further gap-filling 
was performed, using gap-filler (55). Contigs smaller than 1 
kb were ignored. Two methods were used to annotate these 
draft genome sequences and the outcome merged. First, an- 
notation was transferred from the reference genomes (i.e. 
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B. bovis T2Bo, B. diver gens BOND and B. diver gens 1802A) 
to additional strains using the Rapid Annotation Transfer 
Tool (RATT) (56). Genes models with errors were excluded. 
Second, Augustus (47) was trained with the complete refer- 
ence gene set to perform ab initio annotation. 

Transcriptomic analysis 

Strand- specific messenger ribonucleic acid (mRNA) se- 
quencing libraries were prepared from total RNA of B. 
bigemina PR strain (two biological replicates) using TruSeq 
stranded mRNA Sample Prep Kit LT (Illumina) according 
to manufacturer's instructions. Briefly, polyA+ mRNA was 
purified from total RNA using oligo-dT dynabead selec- 
tion. First strand cDNA was synthesized using randomly 
primed oligos followed by second strand synthesis where 
dUTPs were incorporated to achieve strand-specificity. The 
cDNA was adapter-ligated and the libraries amplified by 
PCR. Libraries were sequenced in Illumina Miseq with 
paired-end 150 bp read chemistry. Strand-specific RNA-seq 
paired-end reads were mapped onto the B. bigemina PR 
genome with TopHat2 (57) with options '-library-type = 
fr-firststranded' and -no-novel-juncs'. Genes in B. bigem- 
ina PR orthologous to B. bigemina BOND ves genes were 
identified from B. bigemina PR gene annotations and their 
corresponding normalized transcript abundances (FPKM 
values) were quantified from aligned reads using Cuffdiff2 
(58). In case of multiple orthologs in PR for a particular 
BOND ves gene, the orthology with highest FPKM was se- 
lected. 

Sample preparation for proteomics 

Protein from B. bigemina PR cell lysates was dispensed into 
low protein-binding microcentrifuge tubes (Sarstedt, Le- 
icester, UK) and made up to 1 60 |xl by addition of 25 mM 
ammonium bicarbonate. The proteins were denatured us- 
ing 10 |il of 1% (w/v) RapiGest™ (Waters MS Technolo- 
gies, Manchester, UK) in 25 mm ammonium bicarbonate 
followed by three cycles of freeze-thaw and two cycles of 10 
min sonication in water bath. Sample was then incubated 
at 80°C for 10 min and reduced with 3 mM dithiothreitol 
(Sigma-Aldrich, Dorset, UK) at 60°C for 10 min then alky- 
lated with 9 mM iodoacetamide (Sigma-Aldrich, Dorset, 
UK) at room temperature for 30 min in the dark. Proteomic 
grade trypsin (Sigma-Aldrich, Dorset, UK) was added at 
a protein: trypsin ratio of 50:1 and samples incubated at 
37°C overnight. In a parallel experiment aiming to maxi- 
mize protein separation, protein samples were also solubi- 
lized and resolved on a NuPAGE® Novex® 4-12% Bis-Tris 
Gel (Life Technologies Ltd., Paisley, UK) following manu- 
facture's instruction. Four gel slices were excised and each 
digested with trypsin (Sigma-Aldrich, Dorset, UK). 

Mass spectrometry 

Peptide mixtures from both in solution digestion and ID- 
sodium dodecyl sulphate-polyacrylamide gel electrophore- 
sis (SDS-PAGE) were analysed by on-line nanoflow liq- 
uid chromatography using the nanoACQUITY-nLC sys- 
tem (Waters MS Technologies, Manchester, UK) coupled 



to an LTQ-Orbitrap Velos (ThermoFisher Scientific, Bre- 
men, Germany) mass spectrometer equipped with the man- 
ufacturer's nanospray ion source. The analytical column 
(nanoACQUITY UPLCTM BEH130 C18 15 cm x 75 |xm, 
1.7 |xm capillary column) was maintained at 35°C and a 
flow-rate of 300 nl/min. The gradient consisted of 3-40% 
acetonitrile in 0.1% formic acid for 90 min then a ramp of 
40-85% acetonitrile in 0.1% formic acid for 3 min. Full scan 
MS spectra (m/z range 300-2000) were acquired by the Or- 
bitrap at a resolution of 30 000. Analysis was performed in 
data-dependent mode. The top 20 most intense ions from 
MSI scan (full MS) were selected for tandem MS by colli- 
sion induced dissociation and all product spectra were ac- 
quired in the LTQ ion trap. Ion trap and Orbitrap maximal 
injection times were set to 50 and 500 ms, respectively. 

Proteomic analysis 

Thermo RAW files were imported into Progenesis LC-MS 
(version 4.1, Nonlinear Dynamics, UK). Runs were time 
aligned using default settings and using an auto selected 
run as reference. Peaks were picked by the software and fil- 
tered to include only peaks with a charge state of between 
+2 and +6. Peptide intensities were normalized against the 
reference run by Progenesis LC-MS and these intensities 
are used to highlight differences in protein expression be- 
tween control and treated samples with supporting statisti- 
cal analysis (ANOVA and ^-values) calculated by the Proge- 
nesis LC-MS software. Spectral data were transformed to 
mgf files with Progenesis LC-MS and exported for peptide 
identification using the Mascot (version 2.3.02, Matrix Sci- 
ence) search engine. Tandem MS data were searched against 
the predicted protein set of the B. bigemina BOND refer- 
ence genome sequence. Search parameters were as follows: 
precursor mass tolerance set to 10 ppm and fragment mass 
tolerance set to 0.8 Da. One missed tryptic cleavage was per- 
mitted. Carbamidomethylation (cysteine) was set as a fixed 
modification and oxidation (methionine) set as a variable 
modification. Mascot search results were further processed 
using the machine learning algorithm Percolator. The false 
discovery rate was <1%. Individual ion scores >13 indi- 
cated identity or extensive homology (P < 0.05). Results 
were imported into Progenesis LC-MS as .xml files. At least 
two unique peptides were required for reporting proteins 
that were differentially expressed. Results from in solution 
digestion and gel slices of 1D-SDS PAGE are grouped using 
'Combine analysed fractions' function in Progenesis LC- 
MS where statistical analysis are updated. 

Annotation of v^s-like genes 

We searched for ves-like and smORF-like genes among B. 
bigemina and B. divergens translated open reading frames 
using hidden Markov models (HMM) built using HMMER 
v3.0 (http://hmmer.janelia.org/) from sequence alignments 
of BbovVesla, BbovVeslfi and BbovSmORF. Once this had 
revealed native vesl sequences in B. bigemina and B. diver- 
gens (though not smORF), these sequences were used in 
turn to create new hidden Markov models (HMMs) for a 
second search, which identified ves2 in each case. Lastly, 
tBLASTx and BLASTn were applied to the B. bigemina 
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and B. divergens genome sequences using native vesl or ves2 
sequences to identify any unannotated copies, vesl struc- 
tures are highly mutable, and therefore it is inadvisable to 
simply rely on annotation transfer and sequence homology 
with known genes in B. bovis, which may not adequately 
capture ves-like genes in other genomes. Coding sequences 
for the ves-like genes analysed in this study have been de- 
posited on the Wellcome Trust Sanger Institute FTP site 
(ftp://ftp.sanger.ac.uk/pub/pathogens/Babesia/). 

Comparative genomics 

We used OrthoMCL v2.0.9 (59,60) to examine species- 
specific genes and gene families with species disparities in 
copy number. Protein sets from B. bovis T2Bo, B. bigem- 
ina BOND and B. divergens 1802A were clustered using Or- 
thoMCL, set to maximize cluster size (i.e. minimized se- 
quence identity requirements for clustering). From these 
data we calculated the number of genes shared by all three 
species in single copy ('conserved') or as part of gene fami- 
lies with variable copy number ('semi-conserved'). We also 
extracted the number of vesl homologs and vesl-Xiks short- 
ened forms (i.e. vesl) in each genome. Finally, we calcu- 
lated the number of species-specific genes in single or multi- 
ple copies. In comparisons of chromosomal rearrangements 
across species, and in comparisons of conserved vesl/ves2 
loci across strains, we used the Artemis Comparison Tool 
[ACT (61)] to visualize conservation in gene order. 

Phylogenetic analysis 

Translated nucleotide sequences for each sub-family were 
aligned in ClustalW (62) and then manually edited in 
BioEdit v7.1.3. (63). A phylogeny of all ve^-like genes 
was estimated by selecting representative samples of 
BbovVesla/l(3/l7, BbigVesla/lb and BdivVesl sequences 
and creating an 840 character DNA sequence alignment 
of the conserved C-terminal domain (corresponding to the 
transmembrane and cytoplasmic domains). The phylogeny 
was estimated using a GTR+G model in PHYML (64) with 
100 non-parametric bootstraps and in MrBayes (65,66) un- 
der these settings: Nruns = 4, Ngen = 5 000 000, sample- 
freq = 500 and default prior distribution. Phylogenies of all 
sequences were estimated for each sub-family (combining 
reference and strain sequences) using a GTR+G model in 
PHYML. Bootstrap proportions were attempted but these 
were low, as is typical for large alignments of such hypervari- 
able genes. Phylogenies of reference and strain sequences 
at conserved positions were estimated using the same ap- 
proach for each species, for use in co-phylogenetic analyses 
(see below). 

Co-phylogenetic analysis 

Phylogenetic reconciliation is a method for resolving topo- 
logical disparities between two phylogenetic trees that are 
expected to have the same topology and was initially devel- 
oped for the comparison of parasite phylogenies with that 
of their hosts, and of gene family trees with species phy- 
logenies (67,68). Disparities are resolved by positing evo- 
lutionary events from a general model of host-parasite or 



gene family coevolution (i.e. codivergence, duplication, loss 
and host switching). The statistical significance of topo- 
logical congruence between two associated trees is typi- 
cally assessed using permutation tests. Taking each refer- 
ence genome and one other strain, we identified all ves-like 
loci that were conserved in position and contained a se- 
quence in both genomes. After extracting these sequences, 
we estimated two phylogenies for each ves-like sub-family, 
one containing 'reference' sequences and another 'strain' se- 
quences. Since these genes occurred in conserved positions 
we can assume that they evolved before the separation of 
reference and strain genomes, and therefore the two phylo- 
genies should look exactly the same in the absence of any 
transposition between loci after separation (i.e. recombina- 
tion). The significance of phylogenetic congruence was as- 
sessed for each sub-family by permuting the strain tree in 
Jane 4 (69). 

Recombination analysis 

Phylogenetic incompatibility describes the presence of mul- 
tiple phylogenetic signals within a single sequence align- 
ment and is the historical signature of recombination. The 
pair-wise homoplasy index (PHI) detects incompatibility 
between sites and is robust in the presence of rate hetero- 
geneity (70), which might otherwise simulate the effects of 
recombination. P < 0.05 for PHI indicates significant in- 
compatibility between sites within an alignment, consis- 
tent with recombination. For each ves-like gene family, PHI 
was calculated for 1000 sequence quartets selected at ran- 
dom from multiple alignments of full-length nucleotide se- 
quences. Sequence triplets were attempted but failed to pro- 
duce sufficient genetic variation for a viable test in most 
cases. The proportion of quartet alignments showing sig- 
nificant phylogenetic incompatibility (P pi ) was calculated. 
A second method for detecting sequence mosaicism was im- 
plemented using 3seq (71). 3seq carries out exhaustive com- 
parisons of sequence triplets in a multiple alignment of nu- 
cleotide sequences to identify mosaics and returns a P- value 
for each triplet adjusted by a Dunn-Sidak correction for the 
multiple comparisons made in that run. P pi was again cal- 
culated from the proportion of sequence triplets with P < 
0.05. 

Data accessibility 

Sequence read data have been submitted to the European 
Nucleotide Archive (http://www.ebi.ac.uk/ena) with the ac- 
cession numbers ERP000167 and ERP000252. BioPro- 
ject identifiers (http://www.ncbi.nlm.nih.gov/bioproject/) 
for the B. bigemina BOND and B. divergens 1802A ref- 
erence genomes are PRJEB5046 and PRJNA230984, re- 
spectively. Genome sequences for B. bigemina strains 
JG29, BbiS3P and PR, B. bovis C9.1 and B. divergens 
Rouen 1987 are available from the Wellcome Trust Sanger 
Institute FTP site (ftp://ftp.sanger.ac.uk/pub/pathogens/ 
Babesia/). Mass spectrometry and proteomic data have 
been deposited in the ProteomeXchange Consortium (http: 
//proteomecentral.proteomexchange.org) via the PRIDE 
partner repository (http://www.ebi.ac.uk/pride/) with the 
dataset identifier PXD000629. 
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RESULTS 

Relative structure and content of Babesia genome sequences 

For a comparative analysis of the ves gene repertoire in 
Babesia, we have produced high-quality, draft genome se- 
quences for an additional strain of Babesia bovis (C9.1), 
four strains of B. bigemina (BOND, PR, BbiS3P and JG29) 
and two strains of B. divergens (1802A and Rouenl987). De- 
scriptions of these strains and genome sequences are given 
in Table 1 . Variation in genome size, gene number and the 
proportion of coding sequences are related to the quality of 
sequence assemblies and sequence contiguity, which is lower 
in sequences produced from short reads only The larger 
number of coding sequences in B. bigemina relative to other 
species is due to unique gene duplications of conserved 
gene families and to a greater number of species-specific se- 
quences encoding hypothetical proteins. Nonetheless, clus- 
tering analysis of B. bovis T2Bo coding sequences combined 
with corresponding data from B. bigemina BOND and B. 
divergens 1802A shows that gene content is consistent be- 
tween species and that variation in surface antigens occurs 
against a largely conserved genomic background. Figure 1 
shows that coding sequences seen in all species represent 78- 
88% of genes; 68-81% of all coding sequences show one- 
to-one correspondence (i.e. perfect orthology). Therefore, 
species-specific genes represent between 12.2 and 21.5% of 
genes in these species, with B. bigemina displaying the high- 
est proportion of unique features. These are maximum es- 
timates since they include predicted protein sequences that 
failed to cluster, and so potential mis-annotated sequences 
could be designated as species-specific genes. Ves gene ho- 
mologs comprise a large proportion of these species-specific 
genes: 4.7% of all coding sequences in B. bovis T2Bo, 8% in 
B. bigemina BOND and 1 1.2% in B. divergens 1802A. While 
the remaining unique sequences are not homologous to 
vesl, they are often predicted to be expressed at the cell sur- 
face. This prediction is based upon their overall structural 
similarities with the B. bovis VESA1 polypeptides, which 
are placed on the erythrocyte surface (42), including a well- 
conserved C-terminal end and predicted transmembrane 
domain. This indicates that the most dynamic features of 
these genomes are associated with the host-pathogen inter- 
face, nullnull 

F^s-like loci coincide with chromosomal breakpoints 

By mapping sequence scaffolds to the B. bovis genome, 
we determined that B. bigemina BOND has four chromo- 
somes, as suggested previously (70); however, these are not 
co-linear with B. bovis chromosomes. Sequence alignment 
based on conserved gene order demonstrates that segmen- 
tal inversion and chromosomal rearrangement have been 
very frequent since these Babesia species separated. Sup- 
plementary Figure S1A/B aligns various sequence blocks 
from the B. bovis T2Bo and B. divergens 1802A genomes 
that correspond to full-length chromosomes in B. bigemina 
BOND. The positions of coding sequences homologous to 
B. bovis vesl genes are marked and these coincide with chro- 
mosomal breakpoints. When we inspect breakpoints in co- 
linearity more closely, as shown in Figure 2, ves-like genes 
occur at homologous positions in all three genomes, but it is 



clear that these homologs have quite different size and ori- 
entation (Figure 2a). Other multi-copy genes (shaded yel- 
low) also congregate at similar positions (Figure 2b), remi- 
niscent of the proximity of smORF to vesl in B. bovis T2Bo 
(22). In addition, and somewhat anomalously, B. bigemina 
BOND has 28 ves lb loci on chromosome 3 at positions not 
associated with genome rearrangements, which likely repre- 
sent lineage-specific insertion events (Figure 2c). The pres- 
ence of ves-like genes at corresponding positions in differ- 
ent Babesia species may be ancestral, (i.e. these genes are 
orthologs, descended directly from a progenitor at the same 
position in the ancestral genome), or it may be mechanis- 
tic, (i.e. these positions are prone to rearrangement in all 
species and ves-like genes, which are not orthologous, inde- 
pendently transpose to them as regions in which purifying 
selection is weak). 

F^s-like sequences diverge rapidly after speciation 

Vesl genes are among the fastest-evolving genes in Babesia 
genomes; in comparisons between species, they display 
about 25% amino acid identity and 75% of vesl genes oc- 
cur in the top 25% of all genes when ranked by sequence 
divergence, which accounts for the skew towards low iden- 
tity in frequency distributions of sequence divergence across 
the whole genome (see Supplementary Figure S2). Rapid di- 
vergence of VESA proteins is also evident in comparisons 
of genomic repertoire. The B. bovis T2Bo genome contains 
two families of vesl genes {vesl a and veslfi), as well as the 
smORF gene family that are found interspersed among vesl 
clusters. Through comparison with other genome sequences 
using BLAST and HMMER, we have identified ves-like 
genes in both B. bigemina BOND and B. divergens 1802A, 
although with important differences, as shown in Figure 3. 
Hereafter, we will prefix gene names with species labels, e.g. 
BbovVesla, BbigVesla, BdivVes2a, etc. While the relatively 
conserved C-terminal domain unambiguously confirms the 
homology of all vesl predicted proteins, these proteins are 
highly divergent in their remaining primary structures and 
no B. bigemina or B. divergens sequences contained extra- 
cellular 'cysteine- and lysine-rich domain' (CKRD) or the 
'variant domain conserved sequences' (VDCS) domain pre- 
viously characterized in B. bovis VESA (20). 

In B. bigemina BOND there are two families homologous 
to vesl that are approximately equally abundant; we refer 
to these families as BbigVesla (N = 74) and BbigVeslb (N 
= 80). While ~70% of B. bovis vesl are arranged in puta- 
tive LATs consisting of both vesl types (26), these homologs 
are similarly arranged in only 35% of cases in B. bigemina 
BOND. Indeed, the phylogeny of all vesl genes (see Figure 
4 below) indicates that these homologs in B. bigemina are 
not orthologous to the vesl a and veslfi of B. bovis. Sixteen 
gene copies (' BbigVeslba') appear to be recombinant veslb, 
with vesla-typQ 3' ends. We found no smORF homologs in 
B. bigemina. 

In addition to BbigVesla and BbigVeslb, there is a third 
gene family in B. bigemina that we refer to as BbigVes2 (N 
= 116). These genes were identified due to their homol- 
ogy with the 5 / -most 500 bp of BbigVeslb and they en- 
code VESA-like proteins lacking the C-terminal domain of 
the canonical VESA1 or any GPI-anchor signal. These pre- 
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Figure 1. Pie charts showing the classification of predicted coding sequences in three Babesia genomes, based on three-way OrthoMCL analysis. Genes 
with a 1:1:1 distribution are termed 'conserved'. Genes present in all three species with variable copy number are called 'semi-conserved', vesl genes in 
Babesia bovis and full-length homologs in other species are represented in yellow. SmORF 'm B. bovis and ves-Mke, short genes (ves2) in other species are 
represented in orange. The remaining species-specific genes (either single or multi-copy) are represented by green, blue and purple for Babesia bigemina, B. 
bovis and Babesia divergens respectively. 

Table 1. Properties of seven Babesia genome sequences produced in this study 
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length 
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density 
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(Mbp) 
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coverage 


genes 


Coding 


(bp) 


introns 


(bp) 


B. bovis 


Mexico 


Cow 


7.61 


42 


46 


2.05 


591 


3726 


70.5 


1501 


62.53 


2154 


C9.1 


























B. bigemina 


Argentina 


Cow 


13.84 


51 


6 


3.52 


8 


4457 


66.3 


1531 


58.44 


2306 


BOND 
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Puerto 
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12.68 


50 


320 


2.46 


362 
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NA 


1812 
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ND 


PR 
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Argentina 


Cow 


12.94 


50 


533 


3.0 


218 


4948 


NA 


1805 
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ND 
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B. bigemina 


Mexico 


Cow 


15.9 


50 


1299 


2.11 


262 


5689 


NA 


1752 


54.23 


ND 


JG29 


























B. divergens 


France 


Cow 


9.58 


42 


81 


1.12 


43 


4134 


64.0 


1487 


59.02 


2321 


1802 A 


























B. divergens 


France 




8.97 


46 


482 


NA* 


736 


4097 


NA 


1439 


57.51 


ND 


Rouenl987 




Human 























Note. All statistics refer to contigs greater than 1 kb in size. Due to the number of sequencing gaps, entries marked 'ND' could not be calculated. Entries 
marked 'NA are omitted because the contigs were ordered against an arbitrary union file of all contigs. 



dieted proteins do not possess a PEXEL motif, which is re- 
quired by some proteins in Plasmodium for secretion (73), 
but they do have a predicted signal peptide and, without any 
obvious means of membrane attachment, we suggest that 
they are secreted. ves2 genes are located in similar positions 
to vesl in the B. bigemina genome (see Supplementary Fig- 



ure SI). Comparison of the homologous region shared by 
veslb and ves2 (not shown) shows that all ves2 gene copies 
share a common ancestor and are structurally distinct from 
all veslb copies, demonstrating the ves2 are not pseudogenic 
fragments of vesl. 
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Figure 2. Comparison of gene order at three regions of chromosomal rearrangement. Forward and reverse strand are represented by horizontal bars, 
colour-coded by species (purple: Babesia divergens 1802A; green: Babesia bigemina BOND; blue: Babesia bovis T2Bo). Genes are indicated by boxes within 
reading frames, ves-like gene models are colour-coded as indicated by the key. Vertical grey bars between genomes represent significant BLASTn hits 
as calculated in ACT. (A) The region spanning 1 154— 1238kb of chromosome 1 in B. bigemina, which corresponds to chromosomal breakpoints in both 
other species. (B) The region spanning 302-364 kb of chromosome 2 in B. bigemina, which corresponds to a chromosomal breakpoint in B. divergens. 
(C) The region spanning 975-1335 kb of chromosome 3 in B. bigemina that is conserved in both other species but which has experienced numerous B. 
bigemina-spQcific insertions of BbigVeslb genes (shaded blue). The genomic locations of regions a-c are shown in Supplementary Figure S1A/B. 



In B. divergens 1802A, we observe a single gene family ho- 
mologous to vesl (N = 202) that encode predicted proteins 
with a canonical VESA C-terminal domains for membrane 
attachment. This family includes two forms shown in Figure 
3, one short and another longer due to a cysteine-rich, low 
complexity region towards the N-terminus that is absent 
from the short forms; otherwise these two forms are closely 
related and do not form monophyletic groups in phyloge- 
netic trees (i.e. long and short forms are paraphyletic; see 
Figure 4). These genes are arranged in divergent orienta- 
tion, suggestive of an LAT, in only three instances, although 
the abundance of gaps in the B. divergens assembly substan- 
tially limits our ability to quantify genomic context. The 
impression is, however, that ves-like genes in B. divergens 
are typically arranged in tandem at sub-telomeric loci, but 
rarely on opposing strands like the LAT. As with B. bigem- 
ina, we found no smORF homologs in B. divergens but we 
did identify ves-like genes that are homologous to the 5' 



region of vesl, encoding predicted proteins without trans- 
membrane attachment. In B. divergens these genes separate 
into two families [referred to as BdivVes2a (N = 95) and 
BdivVes2b (N = 62)], both of which are reciprocally mono- 
phyletic with respect to vesl, and therefore represent distinct 
lineages rather than partial vesl pseudogenes (see Figure 4). 

With the discovery of ves2 in the B. bigemina and B. di- 
vergens genomes, it is intuitive to suggest that ves2 are anal- 
ogous to smORF in B. bovis. There is no compelling ev- 
idence for homology between smORF and ves-like genes, 
and so it seems most likely that smORF has evolved in B. 
bovis independently to perform a role analogous to ves2. 
We chose to include smORF in our subsequent analyses 
because, as a multi-copy family of secreted proteins, show- 
ing clonal variation (74) and spatial association with vesl, 
smORF clearly share similar circumstances to BbigVes2 
and BdivVes2. We checked the B. bovis genomes for gen- 
uine orthologs to ves2 and discovered a tandem gene 
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N 
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74 (48) 

43 (23) 

3(3) 

2(2) 

43 (39) 



B. bigemina BOND 

vesla 



veslb 

veslba 
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B. divergens 1802 A 

vesla 

veslb [HJL 
ves2a 



} 



ves2b 



(26) 

228 

(50) 
95 (87) 
62 (56) 



C-terminal domain 



Low complexity region 



Mean length ± SD 
(amino acids) 

1317.1 1 37.4 
1101.0 ±77.6 
906.0 ± 13.0 

273.5 ±16.3 

191.6 ± 51.1 



75 (38) 1786.6 ± 335.1 

87(43) 1202.6 ±33.7 

16(9) 3025.0 ± 163.6 

116(113) 250.8 ±10.3 



746.6 ±25.1 
1077.4 ±219.8 
442.0 ± 23.0 

303.7 ± 106.2 



Figure 3. Ves gene repertoire in Babesia genome sequences. Gene models are drawn to scale (average lengths are shown) and are represented by shaded 
boxes (exons) and lines (introns). The presence of low complexity regions (typically repetitive and with variable length) and the conserved C-terminal 
domain (containing a single transmembrane helix) are indicated. 



pair (BBOVJII002570/2580), which encode homologs to 
BbigVes2 in both B. bovis strains. Hence, these genes may 
be the sole remaining representatives of a BbovVes2 family 
now largely lost, perhaps replaced by smORF. 

Vesl genes share a 3' region that encodes the conserved 
VESA C-terminus. This is the only region for which a mul- 
tiple sequence alignment can be made across all species. 
A maximum likelihood phylogeny was estimated from the 
840-character sequence alignment using PHYML (see Fig- 
ure 4a). The phylogeny indicates that, while the ancestor of 
these Babesia species possessed a vesl gene family, the di- 
morphism evident in B. bovis (Bbov Vesla I If) and B. bigem- 
ina (BbigVesla/lb) is not ancestral and has independent 
origins in these lineages. Given the large genetic distances 
between these paralogous families within the same species, 
this indicates very substantial structural innovation post- 
speciation on multiple occasions. If we compare the branch 
lengths between Bbov Vesla/ 1/3 (0.5 substitutions per site) 
and BbigVesla/lb (1.5), the paralogs in B. bovis appear to 
have diverged much less. This is consistent with the hypoth- 
esis that Bbov Vesl a evolved from Bbov Vesl /3 after the ori- 
gin of B. bovis through reverse transcription, i.e. integration 



of an incompletely spliced Bbov Vesl transcript, which was 
originally proposed to account for the lack of introns in 
Bbov Vesl a (22) . Furthermore, we have identified a third 
vesl sequence type in the B. bovis T2Bo genome not previ- 
ously reported (BbovVesly; N = 3). Figure 4a shows that 
the branch length between BbovVesly and Bbov Vesla /I 
(1.5) is consistent with the distance between paralogous 
families in B. bigemina (1.46), suggesting that BbovVesla 
may have substituted BbovVesly as one half of the het- 
erodimer. 

The relationships of ves2 are not shown in Figure 4a 
because ves2 lack the conserved C-terminal region of the 
canonical forms. Therefore, we generated a network based 
on pair- wise FASTA scores (Figure 4b), exhaustively com- 
paring full-length VESA1 and VESA2 sequences from all 
species, to examine the suggestion that ves2 genes indepen- 
dently evolved from ancestral vesl sequences. Canonical 
VESA1 sequences cluster together at the centre of the net- 
work, while protein sequences corresponding to BbigVes2 
and BdivVes2a/2b share no connections. This is consis- 
tent with vesl in all species having a single common ori- 
gin, as implied in Figure 4a, while ves2 have separate ori- 
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Figure 4. (A) Unrooted maximum likelihood phylogeny of vesl genes from Babesia spp. based on a multiple nucleotide sequence alignment corresponding 
to the conserved C-terminal domain of VESA1 only (840 characters). A GTR+r model was applied. Support for principal nodes is indicated by non- 
parametric bootstraps and posterior probabilities from a Bayesian analysis using the same model. (B) Sequence similarity network based on FASTA scores 
generated from pair-wise comparisons of VESA1 and VESA2 amino acid sequences and generated using BioLayout Express v3.0. Individual sequences 
are represented by spheres, shaded by gene family, connected by lines that represent sequence homology. The network was organized such that spheres 
edge length is minimized and spheres are positioned nearest to their closest relatives. A lower threshold has been applied to exclude poor sequence matches, 
leaving only the strongest similarities as determined by FASTA. SMORF sequences were included, but no FASTA scores exceeded the threshold. Ves-Yikz 
gene families are labelled as described in the text; a single Babesia bovis sequence that clusters close to BbigVes2 (BBOV_III002580) is shown with a red 
circle. 
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gins. BdivVes2a/2b unambiguously cluster most closely to 
BbigVesla, suggesting that they are derived from an an- 
cestor of BbigVesla now lost or unrecognizable in B. di- 
vergens. BbigVes2 clusters most closely to BbigVeslb and 
BbovVesla, suggesting that it was derived from an ances- 
tor of these two gene families. The rare Bbov Ves2 sequences 
introduced above are positioned intermediately between 
BbigVes2 and BbovVeslfi in Figure 4b (circled in red), sug- 
gesting that may share the same origin as BbigVes2. There- 
fore, given that a vesl structure is present in all species (and, 
indeed, in B. microti (17)), but a common ves2 structure is 
not widespread, we may infer that ves2 in B. bigemina and B. 
divergens are not orthologous, and that they have been de- 
rived from vesl (and not vice versa) on separate occasions. 

Multiple VESA2 proteins may be expressed with greater 
abundance than VESA1 

Genome comparison indicates that ves-like genes lacking 
the conserved 3' region have been independently derived 
from the canonical vesl genes during Babesia evolution. Se- 
quence comparison indicates that these ves2 genes are not 
non-functional vesl pseudogenes but, in fact, encode highly 
abundant and polymorphic proteins. To explore the biolog- 
ical differences between vesl and ves2, we carried out pro- 
teomic analyses of global gene expression in B. bigemina 
PR. We identified 1777 peptides corresponding to 366 pre- 
dicted proteins in the B. bigemina BOND reference genome 
(i.e. 8.2% of the predicted proteome; see Supplementary 
Table SI). Figure 5 shows the abundance of VESA1 and 
VESA2 protein relative to each other, and to all other pro- 
teins identified in our data set. Most importantly, it shows 
that of the 1 5 different VES A2 proteins that are expressed, 
seven are more abundant than the most abundant VESA1 
protein; in general VESA2 are significantly more abun- 
dant than VESA1 (P < 0.001; randomization test). The 
distribution of abundance between proteins is also subtly 
different. We observe two co-dominant VESA IB proteins 
(BBBOND.0303890 and BBBONDD305310) and 17 other 
copies expressed at a low level (i.e. between 0.08 and 17.7% 
of the abundance of the dominant protein). This observa- 
tion could be consistent with either the biallelic expression 
of a mixed VESA IB homodimer (we did not detect expres- 
sion of VESA 1 A) or monoallelic expression of VESA lb 
monomers in the mixed population. The results also suggest 
that either multiple VESA2 are expressed simultaneously or 
there is rapid switching among ves2 genes. B. bigemina PR is 
not a clonal line, and our data cannot distinguish these pos- 
sibilities. In contrast to VESA1, the most abundant VESA2 
protein (BBBOND_0302650) showed much less conspicu- 
ous dominance over other expressed VESA2 (see Figure 
5). Hence, VESA2 expression may be less regulated than 
VESA1, although what we observed constitutes only 23% 
of the ves2 repertoire in B. bigemina PR. These results con- 
firm that multiple ves2 genes are competent to be expressed 
and are not simply degenerate vesl gene copies. 

The apparent lack of VESA 1 A expression is puzzling; it 
is unlikely to be due to technical insensitivity since we are 
able to detect a variety of VESA IB and VESA2 peptides 
at even very low abundance. To check this result, we car- 
ried out RNA-seq analysis of ves-like transcripts and this 



corroborated the proteomic data in showing little evidence 
for vesl a expression (Supplementary Figure S3). It is possi- 
ble that, if expressed individually and monoallelically, there 
may be in vitro selection over many generations against par- 
asites expressing VESA1A, e.g. by adhesiveness towards 
polystyrene, removing these parasites from the population. 
This observation deserves further investigation. 

Transposition of ves genes is evident from intraspecies varia- 
tion 

It is apparent from the species comparisons above that vesl 
and ves2 genes are highly labile in terms of their molecu- 
lar sequences and their genomic repertoire. To achieve such 
substantial divergence, we must predict that ves genes are 
frequently transposed to new genomic positions over eco- 
logical or individual timescales. To observe transposition 
directly we sequenced the genomes of additional strains of 
B. bovis, B. bigemina and B. divergens from diverse locations 
(see Table 1) and determined the conservation of ves loci 
at the intraspecific level. Table 2 describes the considerable 
variation within species in the presence and absence of ves 
loci. For example, between 65.3 and 69.4% of BbigVesla loci 
and between 78.2 and 80.8% of BbigVes2 loci are present in 
multiple B. bigemina strains. However, phylogenetic analysis 
of ves gene repertoires from multiple strains produced no ev- 
idence for strain- specific gene family expansions that might 
suggest structural differentiation of ves repertoires within 
a species (data not shown). Thus, while ves gene repertoire 
among divergent B. bigemina strains overlaps structurally 
and phylogenetically, a sizable minority of vesl genes are 
absent from conserved genomic positions in any given in- 
stance, indicating that gene transposition is much more fre- 
quent than either gene duplication or the evolution of novel 
sequences through amino acid substitution. 

Moreover, even when genes are present at the same lo- 
cus in multiple strains, these are not always orthologous; in- 
deed, Table 2 shows that the majority of vesl in B. bovis and 
B. bigemina are not orthologous when conserved in posi- 
tion. Conversely, 75% of conserved smORF loci contain or- 
thologous sequences, and likewise 79.5-85.1% of BbigVes2 
and 86.3% of BdivVes2 are orthologous. Supplementary 
Figure S4 illustrates this distinction for B. bigemina vesl 
and ves2 loci across four strains, contrasting the complete 
conservation of a ves2 locus with three vesl loci for which 
micro-homology has broken down to some extent. While 
the orthologous sequences at the ves2 and upstream vesl 
loci cluster together in the global gene family phylogenies, 
sequences from the downstream vesl loci, which are not or- 
thologous, cluster apart. 

From the results above, widespread transposition of ves 
sequences is apparent. To assess its significance we adopted 
a co-phylogenetic approach, as described in Supplemen- 
tary Figure S5. Taking pairs of strains in turn, we esti- 
mated maximum likelihood phylogenies for vesl and ves2 
genes that occurred at conserved positions, excluding ves 
genes at positions that were unique to a single strain. If 
transposition has not occurred, these phylogenies should 
be identical, or at least display variations not significantly 
greater than expected by systematic error, because the genes 
would be orthologous and diversified in the common ances- 
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Figure 5. Frequency histogram showing ranked abundance of peptides detected in proteomic analysis of Babesia bigemina PR. The position of VESA1 
(light green) and VESA2 (dark green) predicted proteins are shown with filled circles. Frequency histograms of VESA-like proteins only are shown in the 
insets. 

Table 2. Percentage values for presence, absence and orthology of ves loci in Babesia reference genome sequences 



Reference 



Ves gene family N 



Strain 



% Loci 











Present 


No assembly 


Absent 


Orthologous 


B. bovis T2Bo 


Bbov Ves la 


82 


C9.1 


51.3 


22.4 


26.3 


25 




BbovVeslfi 


50 


C9.1 


54.5 


25 


20.5 


12.5 




smORF 


43 


C9.1 


55.8 


14 


30.2 


75 


B. bigemina 


BbigVesla 


78 


PR 


65.3 


6.7 


28 


23.5 


BOND 






















JG29 


66.6 


2.6 


30.8 


32.2 








BbiS3P 


69.4 


0 


30.6 


26 




BbigVeslb 


79 


PR 


40 


43.7 


16.3 


0 








JG29 


46.6 


37.5 


15.9 
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tor of the two strains concerned. If strain phylogenies are 
significantly different, then this means that, despite being 
found at conserved positions, gene sequences have changed 
since the strains originated, which we interpret as evidence 
for transposition through gene conversion. It is likely that 
the individual genes had at one time been the active locus 
for vesla/a and/or veslfi/b transcription (LAT). While the 
LAT, the genes would have undergone repetitive events of 
segmental gene conversion with formation of extensive mo- 
saics, as previously described (22). 

Congruence between strain phylogenies was assessed for 
each ves-like subfamily using permutation tests executed in 
the programme Jane 4 (69) and judged to be significant if 
P < 0.01. For each panel in Figure 6, the observed 'cost' 
of reconciling incongruence between strain phylogenies is 
marked by a vertical red line; adjacent to this is a frequency 



distribution showing the costs of reconciliation when the 
phylogenies are randomized. The distance between the ob- 
served and randomized costs is a reflection of the congru- 
ence in the trees, where they overlap congruence is likely to 
be significant. Contrary to the null expectation, Figure 6 
shows that vesl phylogenies were incongruent, (i.e. shared 
a level of topological similarity no greater than that due to 
chance), for B. bovis {Bbov Ves la, P = 0.315; Bbov Ves lb, P 
= 0.264), B. bigemina (BbigVesla, P = 0.055; BbigVeslb, 
P = 0.014) and B. divergens (P = 0.1). In contrast, strain 
phylogenies for BbigVes2, BdivVes2a and BdivVes2b were 
significantly congruent (P < 0.001). Thus, when vesl genes 
are present at the same genomic position in different strains 
they almost invariably have different sequences, whereas 
when ves2 genes are conserved in position, they are also of- 
ten conserved in sequence (i.e. are orthologous). 
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Figure 6. Comparison of event costs required to reconcile vesl and ves2 gene phylogenies. For each vesl and ves2 gene family, phylogenies were estimated for 
positionally-conserved genes, i.e. loci conserved in both the reference strain and one other strain. In the absence of recombination after the strains diverge, 
such trees should have the same topology. Significance of topological congruence is assessed through phylogenetic reconciliation using the programme 
Jane 4, whereby evolutionary events are posited to explain topological disparities between the trees. Each histogram shows the frequency distribution of 
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Where observed and randomized event costs overlap, this indicates that there is no significant agreement between the trees, which we interpret as evidence 
for recombination. P- values represent the probability of obtaining the observed cost in randomized co-phylogenies (i.e. of observed tree similarity being 
due to chance), and are mean averages taken over all cost combinations. 



vesl and ves2 genes differ in their exposure to recombination 

The transposition of vesl and, to a lesser extent, ves2 se- 
quences between loci is consistent with the idea that ectopic 
gene conversion of ves sequences is frequent and important 
in driving antigenic variation (22). Sequence mosaicism has 
been observed in BbovVesl genes and is interpreted as evi- 
dence for historical recombination (22). We sought to com- 
pare the role of recombination between vesl in different 
species, and between vesl and ves2 in the same genomes, by 
estimating the frequency of sequence mosaicism using phy- 
logenetic methods. Evidence for recombination was identi- 
fied in the form of phylogenetic incompatibility (PI) among 
characters, when sampling alignments of full-length, ves- 
like sequences. Two programs, PhiPack (70) and 3seq (71), 
were used to assess the significance of PI in multiple align- 
ments sampled from all full-length sequences (see methods). 
The metric P pi records the proportion of alignments that 



returned significant PI (i.e. P < 0.05) for any given sub- 
family. As Figure 7 shows, vesl genes in all three species 
displayed significant PI. Using 3seq, P pi tended to be in the 
0.1-0.25 range, although BdivVeslb sequences displayed PI 
much more frequently (Ppi = 0.42). Using PhiPack, Ppi was 
between 0.88-0.99 for the same gene families. The differ- 
ence between methods likely results from the use of four se- 
quences in PhiPack analyses (one more than in 3seq anal- 
yses), which increases the chances of observing PI. The 
increased P P i in BdivVeslb was not observed when using 
PhiPack. 

A clear difference in P pi is observed between vesl and 
ves2 subfamilies using both programs. In co-phylogenetic 
analyses, BbigVes2 and BdivVes2b showed little evidence 
of transposition or in situ gene conversion. This is corrob- 
orated in Figure 7, where P pi is 0.001 and 0.033 respec- 
tively using 3seq. SmORF and BdivVes2a co-phylogenies 
were more equivocal but these two subfamilies also offer 
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Figure 7. Evidence for recombination among gene copies for ves-like gene families in Babesia spp. using two programs: 3seq (A) and PhiPack (B). The 
proportion of sub -alignments showing significant phylogenetic incompatibility (P p i) is shown for vesl and ves2 gene families, with bars shaded by species 
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less evidence for recombination than among vesl sequences 
(P pi = 0.007 and 0.128 respectively). When we compare 
the actual P- values associated with 3seq analyses of all se- 
quence triplets (Figure 8), these are significantly lower for 
each ves2/smORF subfamily relative to vesl genes in the 
same genome. The exception here is Bdiv Ves2a, which has 
a comparable P p i to Bdiv Vesl but a lower mean P- value. 
In summary, those sub-families showing greater congruence 
in co-phylogeny analyses tend to show significantly less PI 
in analysis of sequence mosaics, and so vesl and ves2 sub- 
families display a markedly different evolutionary dynamic 
that is consistent across species. Although the function(s) 
of these proteins are not yet known, they are unlikely to be 
exposed on the infected erythrocyte surface. Immune selec- 
tion pressure may have been greatly reduced following their 
derivation from vesl genes as a result, and they therefore 
show greater co-phylogeny. 

The position of recombination breakpoints within ves- 
like sequence alignments could indicate low-complexity 
structural motifs that facilitate recombination by promot- 
ing strand annealing between homologous sequences (32). 
The most likely 5' and 3' breakpoints for recombinant tracts 
detected by 3seq can be plotted onto the sequence align- 
ments for each ves-like gene family (Supplementary Fig- 
ure S6). While there is no consistent pattern between fami- 
lies and sometimes no preference is shown at all, it appears 



that some breakpoints may be associated with repetitive re- 
gions that are unique to each ves gene family. For exam- 
ple, in the case of Bbov Vesl a, the 5' breakpoint between 800 
and 900 bp is immediately preceded by sequences encoding 
an alanine-rich repeat region and another domain contain- 
ing cysteine-based repeats. The 3' breakpoint between 2700 
and 2800 bp is immediately preceded by sequences encoding 
the largest glycine/ly sine-rich repeat region in the molecule. 
Similarly, BbigVesla has a dominant breakpoint between 
1250 and 1500 bp that immediately precedes sequences en- 
coding a low-complexity region dominated by serine, thre- 
onine and proline residues. However, such regions are com- 
mon in these molecules and so these associations will re- 
quire further validation. 



DISCUSSION 

In common with other pathogens, the most labile features 
of Babesia genomes are those genes encoding surface anti- 
gens. Vesl genes, and the shortened forms derived from 
them, display the greatest divergence in species comparisons 
and the most polymorphism in strain comparisons. These 
genes are also found consistently in genomic regions other- 
wise depleted of coding sequences and prone to rearrange- 
ment. At each level of inspection, our analyses have em- 
phasized that vesl -like sequences are transposed between 
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locations: the variation in presence or absence of individ- 
ual loci, the non-orthology of sequences occupying homol- 
ogous positions in different species and strains, the incon- 
gruence of vesl co-phylogenies, and the abundant evidence 
for recombination between vesl sequences. In each respect, 
ves2 and smORF sequences are different; they are more con- 
sistently conserved in position, their co-phylogenies have 
greater congruence and their sequence alignments present 
significantly less evidence for sequence mosaicism. 

Therefore, there is a fundamental difference in evolution- 
ary dynamic between the canonical vesl, the evolution of 
which is perhaps determined by its particular requirement 
for transposition, and ves2, which evolves in a manner more 
consistent with a basic birth-death model, albeit with sub- 
stantially more gene duplication than most gene families. 

The genomic basis for immune evasion in B. bovis 
(19,20,22) is, in several respects, analogous to the expres- 
sion of PfEMPl from var genes in the malaria parasite Plas- 
modium falciparum (2), to the Variant Surface Glycopro- 
tein (vsg) in the African trypanosome Trypanosoma brucei 
(75); and to the Major Surface Glycoproteins (msg) in the 
ascomycete Pneumocystis carinii (76). The convergent evo- 
lution of immune evasion functions among these proteins 
and their regulatory milieux in diverse organisms has been 
recognized (77-80) and perhaps this reflects their shared 
experience of a relatively conservative vertebrate immune 
response. Typically, an abundant gene family encoding a 
repetitive major surface protein is sited in labile regions of 
the genome; we have seen how ves-like genes are typically 
distributed at the most dynamic sites in the Babesia genome. 
Expression of the surface antigen is commonly restricted to 
a specific locus, while most or all other gene copies are si- 
lenced. Available evidence suggests this is the case for vesl 
genes in B. bovis (25) and perhaps ves genes among Babesia 
spp. in general (this study). Periodic switching of a new vari- 



ant antigen gene, in whole or in part, into the expression 
site from silent loci, or epigenetic transcriptional switch- 
ing among members of the multigene family, then results 
in antigenic variation on the parasite (or IRBC) surface. 

The common role of gene conversion in antigenic vari- 
ability across diverse organisms has also been highlighted 
(32,34,81). By detecting the signature of past recombina- 
tion in multiple sequence alignments, recombination has 
been identified as a principal evolutionary force affecting 
gene family structure of var genes in P falciparum (82-84), 
vsg in T. brucei (30,31,85) and msg in P carinii (86). How- 
ever, recombination of var genes appears to be meitoic, and 
gene conversion is not a demonstrated mechanism during 
antigenic variation within an individual host. Repetition 
within genes or in non-coding flanking regions is thought 
to facilitate recombination, suggesting that genome struc- 
ture is adapted for increasing the mutability of variant anti- 
gen genes (4,87). Analogous motifs may exist in Babesia; 
the well-conserved C-terminal domains of vesl genes could 
plausibly act as annealing points during cross-over, while 
our recombination analyses have located breakpoints to 
repetitive parts of vesl genes. If so, they are unique to a given 
species, since they fall within regions that are not widely 
conserved among species. 

What is recognized less is that these contingency gene 
families have similar phylogenetic patterns, i.e. variant anti- 
gen gene repertoires in related species do not overlap. There 
is no orthology between the vesl in different Babesia spp. 
(Figure 4), just as there are no orthologous vsg in differ- 
ent African trypanosome genomes (31,88), orthologous var 
in comparisons of P falciparum and P reichenowi (35), or 
indeed among P falciparum strains (82). The mutual exclu- 
sivity of repertoires in these situations indicates rapid gene 
turnover; the frequent gain and loss of gene copies after 
speciation, resulting in the substitution of shared, ancestral 
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characters by unique, derived ones. Hence, the rapid loss of 
phylogenetic diversity is a sampling effect; after speciation, 
species-specific sequence types are constantly created while 
ancestral types are a limited number, but all can be 'over- 
written' by gene conversion. This likely explains the lack of 
cysteine-rich CKRD and VDCS domains encoded by B. bo- 
vis ves genes in those of B. bigemina and B. divergens. More- 
over, a lack of these domains may explain the lack of cytoad- 
hesive behaviour in B. bigemina and B. divergens, whereas B. 
bovis uses VESA1 as a highly variable cytoadhesion ligand 
(89). In addition to immune evasion, antigenic variation of 
VESA1 could provide for stochastic sampling of a very large 
molecular space on the endothelial cell surface for a com- 
plementary receptor. 

Another feature of the rapid turnover of ves-like struc- 
tures after speciation is the evolution of ves2 from the N- 
terminal regions of vesl, which we believe has occurred 
on independent occasions. Although the functions of ves2 
and smORFare open questions, there are good reasons for 
believing that they are distinct from vesl. First, we have 
shown that vesl and ves2 /smORF behave differently over 
evolutionary timescales. Second, they have radically differ- 
ent protein structures, and whereas VESA1 is membrane- 
bound in B. bovis (42) and we presume in B. bigemina and 
B. divergens, VESA2 and SMORF are strongly predicted 
to be secreted. Third, the two sequence classes have differ- 
ent expression profiles in B. bigemina PR, suggesting the 
employment of unlinked and perhaps different regulatory 
strategies. 

We can speculate on ves2 function through comparison 
with related Apicomplexan parasites. In Plasmodium spp., 
secretion is thought to be instrumental in remodeling host 
erythrocytes after the parasite has invaded. Altering ery- 
throcyte structure is necessary for parasite metabolism and 
key survival strategies such as sequestration and antigenic 
variation (90-93). The secreted proteomes of Plasmodium 
spp. contain not only key effectors, for example, the PfEMP- 
1 protein in P falciparum, but also diverse chaperones and 
cofactors thought to facilitate the expression of key effec- 
tors (94,95). So, we can imagine VESA2 being secreted into 
the cytoplasm of the IRBC as part of cocktail of parasite 
proteins that alter host cell structure. This might be in sup- 
port of VESA 1 expression in the IRBC plasma membrane 
directly, or to antagonize host proteins with the effect of 
making the IRBC more amenable to parasite survival, as 
happens when Theileria spp., a sister genus to Babesia, in- 
fect white blood cells (96-97). Alternatively, or perhaps ad- 
ditionally, since VESA2 proteins are expressed abundantly, 
they may serve to obfuscate the immune response. This 
could occur by creating a 'smokescreen' effect, inducing fo- 
cus of dendritic cells upon antigens that are inaccessible 
and largely insensitive targets, analogous to the secretion 
of subtelomere-associated variant surface proteins (SVSP) 
into the host cytoplasm by Theileria spp. It has been sug- 
gested that presentation of SVSP peptides by class 1 MHC 
distracts the immune response from exposed antigens on 
the parasite surface (97) and, similarly, VESA2 may pro- 
mote antibodies that have no reactivity to VESA1 (although 
without presentation of MHC, since Babesia spp. only in- 
fect erythrocytes). It remains to be seen if VESA2 are im- 



munogenic, and if antibodies to VESA1 and VESA2 are 
cross-reactive. 

In this study, we have confirmed that the genomic ba- 
sis for antigenic variation in B. bovis is conserved through- 
out the genus, but with substantial, species-specific diver- 
gence in protein structure and gene repertoire. In fact, while 
Babesia genomes are broadly conserved in other respects, 
the evolution of vesl genes is a record of constant change 
and rapid turnover, which we suggest reflects the endur- 
ing challenge of host-parasite interactions over millions 
of years. In addition, ves-like genes with shortened pre- 
dicted proteins have evolved from canonical vesl on inde- 
pendent occasions. These have distinct expression profiles 
(in B. bigemina at least) and evolutionary dynamics, sug- 
gesting a distinct function, though probably still involved in 
antigenic variation, or at the host-parasite interface. From 
previous observations of the B. bovis genome, it is thought 
that recombination between vesl gene copies could be cru- 
cial both to the mechanism of antigenic variation during in- 
fection, and to the generation of antigenic diversity in par- 
asite populations (22,26). These results confirm the princi- 
pal role of recombination across the genus and support the 
emerging view that genomic architecture facilitates recom- 
bination to both promote switching and generates antigenic 
diversity (98). As with var and vsg, the roles of recombina- 
tion, rapid gene turnover and structural innovation in the 
evolution of variant antigens is affirmed in vesl phylogeny. 
The association of this phylogenetic pattern with variant 
antigen function in diverse parasite genomes is testament 
to the convergence in both structure and mechanism by di- 
verse pathogens to combat vertebrate immunity. 
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